Investigating school absenteeism and refusal among Australian children and adolescents using Apriori association rule mining

Identifying and determining the multitude of reasons behind school absences of students is often challenging. This study aims to uncover the hidden reasons for school absence in children and adolescents. The analysis is conducted on a national survey that includes 2967 Australian children and adolescents aged 11–17. The Apriori association rule generator of machine learning techniques and binary logistic regression are used to identify the significant predictors of school absences. Out of 2484, 83.7% (n = 2079) aged (11–17) years children and adolescents have missed school for various reasons, 42.28% (n = 879) are (11–15) years old, 24.52% (n = 609) and 16.9% (n = 420) are 16- and 17-years old adolescents respectively. A considerable proportion of adolescents, specifically 16.4% (n = 407) and 23.4% (n = 486) of 16 and 17 years old, respectively, have selected ‘refused to say’ as their reason for not attending school. It also highlights the negative outcomes associated with undisclosed reasons for school absence, such as bullying, excessive internet/gaming, reduced family involvement, suicide attempts, and existential hopelessness. The findings of the national survey underscore the importance of addressing these undisclosed reasons for school absence to improve the overall well-being and educational outcomes of children and adolescents.

Furthermore, a study has conducted on young people aged 10-17 who had been diagnosed or treated for school refusal behaviour between 1994 and 1998 at the Rivendell Unit in Sydney, Australia, found a high prevalence of mood and disruptive behaviour disorders 22 .Chi-square and Anova tests have been used to analyse the data in this study.Although numerous studies have been conducted on the topic of school refusal and absenteeism, the majority of them have been focused on Europe, Asia, the United States, and Canada, with only a few being have been carried out in Australia.This discrepancy in research has created methodological gaps in the existing evidence.
Many of the earlier studies have primarily concentrated on high school students, specifically 9th graders, making it difficult to obtain accurate statistics.Moreover, past research has relied on ML or statistical methodologies to identify specific behaviours associated with these issues, mainly for predictive modelling and classification.However, these methodologies do not investigate students' behaviour and activities to determine the genuineness of their reasons for absences and the underlying factors contributing to this phenomenon.
Additionally, most studies have relied on clinical referrals or discussions, leading to a lack of research utilizing large, nationally representative datasets to examine absenteeism.In particular, there is a lack of research using association rule mining to investigate students' behaviour and activities, which can provide valuable insights into the underlying reasons for absences.Furthermore, most studies have relied on information from clinical referrals or discussions when exploring the topic of absenteeism.Consequently, there is a lack of research utilizing large, nationally representative datasets to examine this problem, especially using association rule mining to investigate students' behaviour and activities to determine the genuineness of their reasons for absences and the underlying factors contributing to this phenomenon.
Association rule mining is an effective technique for uncovering patterns and relationships in large datasets 23,24 .By identifying frequent itemset and association rules based on co-occurrence relationships, this method allows for the discovery of hidden patterns and associations that may not be apparent through other techniques.When it comes to school absences, association rule mining can help reveal interesting relationships between different factors contributing to absences and provide valuable insights into the underlying reasons behind them.
Despite the effectiveness of association rule mining in uncovering interesting associations or patterns in data, it has not been utilized in any previous studies.Hence, this study aims to employ association rule mining to identify the genuine reasons for school absences and pinpoint at what point it develops into school refusal.Given the absence of prior research employing a large dataset to examine this phenomenon, the present study aims to ascertain the underlying factors contributing to these behaviours by an analysis of data derived from young minds matter (YMM), a nationwide survey in Australia that focuses on mental health and overall well-being.Overall, this study investigates how association rule mining can be applied to discover the hidden information by analysing huge amount of data from YMM to create potentially meaningful patterns to extract the most relevant features related with school refusal and absenteeism to identify, in particular: To accomplish this, the study has been utilized the Apriori algorithm, a widely recognized machine learning algorithm for association rule mining [25][26][27][28][29][30][31] .This algorithm has been widely employed in various fields such as hypothesis testing, numerical analysis, and large-scale data processing 32,33 .Given the lack of prior research utilizing a large dataset to examine this phenomenon, this study aims to uncover the hidden information and create meaningful patterns to extract the most relevant features related to school refusal and absenteeism.

Results
The analysis has begun with the question, 'What was the primary reason for missing school?' .The YMM dataset provides data on 2484 children who did not attend school.Out of these, 1639 children were sick, 256 had medical appointments, 33 had family members who were sick, 1 child faced parental work conflict, 10 lacked transportations, 128 did not want to go to school, 154 had family events, and 263 had other reasons.Figure 1 categorizes these children by age and the aforementioned reasons.This information is crucial in understanding why students are absent from school.
Figure 1 displays that the students who missed school are aged between 11 and 17 years.Notably, a significant percentage of students (12%, 41%, and 23%) in the 15-17 age group express a lack of motivation to attend school.To gain deeper insights into this resistance, analysing the data using the Apriori association rule mining technique would be valuable.This technique helps identify patterns and relationships among the reasons for missing school, shedding light on the underlying factors contributing to their lack of attendance.By employing this technique, patterns and relationships can be uncovered, aiding in our understanding of why students are not attending school.
When conducting association rule mining, antecedents and consequents are determined based on statistically significant relationships between variables in the dataset.The specific antecedents and consequents can vary depending on the research question and the analysis being conducted.
In this analysis, the Apriori algorithm has been applied to the YMM dataset to identify associations between factors related to students' lack of interest in attending school.The algorithm selects rules with higher lift and conviction values, indicating the strength and reliability of the associations.The associated factors of disinterest in going to school lead to interesting sub-issues related to research objectives, outlined in Table 1.
Table 1 presents the associated factors related to disinterest in going to school.The Apriori algorithm is applied to the first consequent as 'Not interested in going to school' .It uncovers two strong associated antecedents as 'Felt life was not worth living' (lift: 2.98, conviction: 1.35) and 'Easily distracted' (lift: 2.99, conviction: 1.36).
However, it is important to note that the identification of antecedents and consequents does not imply a oneway relationship.Rather, it suggests that the presence of the antecedents increases the likelihood of observing the consequents.Additionally, these antecedents can themselves be influenced by other factors, which is why this analysis continues to explore associations with these identified antecedents.
This analysis is continued to investigate the underlying causes of these associated factors whenever a strong association is discovered.For example, 'Bullied by others' is explored and found to have 'Spend less time with family' (lift: 3.16, conviction: 1.84) as a strongly associated antecedent.
Another consequent, 'Spend less time with family' , is significantly linked to the antecedent 'Do you feel bothered when you can't be on the internet/play electronic games?' (lift: 2.26, conviction: 1.41).This factor, in turn, is associated with 'Go without eating/sleeping because of internet or electronic game' (lift: 2.30, conviction: 1.34) and 'Spend less time with family' (lift: 2.00, conviction: 1.24).Association rules are built based on lift and conviction values greater than 1, indicating significant rules, even though the minimum confidence level is set at 34.8% in Table 1.
The identification of an item as both an antecedent and a consequent can occur when there are strong relationships between multiple variables in the dataset.This circular relationship can be a result of complex interactions among various factors influencing the behaviour under investigation.
Based on the values of lift and conviction from the resulted antecedents, the underlying causes of new antecedents such as being bullied by others, attempting suicide, spending less time with family, worrying a lot, feeling restless, feeling angry and feeling bothered when not on the internet/playing electronic games have been investigated.These factors are found to have a significant impact on children's emotional and behavioural issues, as shown in Fig. 2.
Figure 2 indicates that a significant proportion of children and adolescents who have been absent due to illness also have encountered additional challenges or issues.Specifically, 23% (n = 377) of them have reported being victims of bullying, while 69.37% (n = 1137) have displayed a dependency on electronic games or excessive internet use.Additionally, 66.50% (n = 1090) of them, have showed a lack of prioritization when it comes to spending time with their families.This behaviour could potentially be attributed to their engagement in gaming or excessive internet use, or their reluctance to reveal their emotional state resulting from bullying experience.These reasons are also evident in other cases.For children and adolescents who missed school for a doctor's appointment, 26.56% (n = 68) are bullied and 62.5% (n = 160) have reported developing dependencies on internet, playing Although 263 children and adolescents have stated that they had other reasons for missing school, they did not specify whether bullying, internet addiction, electronic games, and spending less time with family are contributing factors for their school absence.The answer has been found in the summary shown in Fig. 2, which reveals that 20.53% (n = 54), 65.78% (n = 173), and 61.59% (n = 162) are affected by bullying, internet/electronic game addiction, and a lack of time spent with family members, respectively.These factors have been found to be significant reasons for school absences, as demonstrated by age in Table 2.
Table 2 highlights the significant prevalence of bullying incidents among children and adolescents who have experienced school absences.The highest percentage of bullying incidents is observed among 11-year-old children at 35.48% (n = 99).Additionally, a significant percentage of children experiencing school absences, ranging from 61 to 71%, develop addictions to internet usage or electronic gaming.Among 15-year-old adolescents, the highest percentage of 75.36% (n = 208) is observed, and they also report spending less time with their families.Moreover, a significant proportion of these children (23.15%; n = 575) have expressed a belief that life is not worth living.They have also developed unhealthy habits such as skipping meals or lacking sufficient sleep.This issue is particularly prominent among the age group of 15 to 17 years old, with percentages of 37.68% (n = 104), 34.48% (n = 210), and 35% (n = 147) respectively.
To examine the association between school absences and various factors, the Apriori algorithm has been used.While this analysis has identified several potential factors related to school absences, it is important to note that association does not necessarily imply causation.To determine the best predicted factors and explain school absences among children, a multivariate approach, specifically binary logistic regression, has been employed.

Multivariate analysis
After identifying the contributing factors for school absences by uncovering the underlying pattern of the variable, a determination has been made regarding their significance.In order to conduct a multivariate analysis, a binary logistic regression has been employed 34 .All potential factors identified through the Apriori algorithm analysis have been used as independent variables in the binary logistic regression.The coefficient and odds ratio have been examined with a 5% error rate to investigate the strength of these relationships.
It is worth noting that a few of these factors do not reach significance based on the conventional 95% confidence interval.In this analysis, the dependent variable is whether a child or adolescent missed school, represented by '1' for absences and '0' for attendance.The estimates, odds ratios (OR), and 95% confidence intervals (CI) can be found in Table 3.
Table 3 presents the outcomes of a binary logistic regression analysis, which investigates the relationships between school absenteeism and the various factors identified through the Apriori algorithm analysis.Based on the results presented in Table 3, it can be observed that children and adolescents who have developed

Discussion
The children and adolescents have provided specific reasons for their school absences in this study.By using the Apriori algorithm on a large dataset from YMM, Australia's recent nationally representative survey, this study has identified 10 associated factors out of 534 variables related to school absenteeism.Notably, bullying, addiction to internet/electronic games, spending less time with family, suicide attempts, and feelings of hopelessness have been found to be significant factors using association rule mining contributing to school absences among Australian children and adolescents.Some of these associated factors have been determined to be significant through the implementation of binary logistic regression analysis.The analysis reveals that while some of the significant factors from association rule mining do not reach statistical significance at the 5% level, they still provide meaningful insights into the relationships and patterns within the dataset.Association rule mining evaluates these associations based on strength and reliability measures such as lift and conviction.
It is worth noting that association rule mining can uncover other significant relationships and patterns in the dataset, even if they do not meet the strict criteria for statistical significance.The emphasis is placed on the strength and reliability of the connections between variables as indicated by lift and conviction values.Therefore, associations identified through association rule mining should still be considered meaningful and valuable, as they provide insights into the dataset, regardless of their statistical significance.Entirely, this research both confirms and expands on previous findings in this area 2,[35][36][37] .
Previous studies have mainly focused on mental disorders 17,22,38 and limited aspects of school functioning, such as teacher's behaviour 39 , interaction [40][41][42] , safety [43][44][45] , while overlooking factors like bullying, internet/game addiction, lack of family time, and feelings of hopelessness.Interestingly, students have not consistently disclosed these factors as reasons for their absences.The use of association rule mining has uncovered hidden information, suggesting that students may develop disinterest or aversion towards school, eventually leading to school refusal.
Furthermore, this study has identified the age groups most impacted by bullying and internet/electronic game addiction, with the highest percentage observed among individuals aged 11 and 15, respectively.The study has also revealed a significant prevalence of suicidal ideation, skipping meals and sleep among students, particularly prominent among individuals aged 15-17.These findings demonstrate the potential of association rule mining to uncover hidden information and gain deeper insights into the reasons behind school absenteeism and school refusal.
Unlike previous studies that rely on existing literature or use a limited number of variables and participants, this research is based on a comprehensive Australian national dataset, including children and adolescents aged 11-17, capturing a crucial period in their academic development.The large and diverse sample enhances the applicability of the findings to a wider population.
The results highlight the importance of parents, teachers, and school authorities being aware of these significant factors contributing to school refusal or absenteeism, as they have a detrimental impact on students' learning.It is observed that while other children attend school, these particular children express a desire to stay at home to engage in internet browsing or play electronic or online games.In accordance with existing research, the results of this study have shown that this reliance has negative consequences such as aggressive behaviour, social isolation, a loss of sense of reality, and health issues such as vision loss and hearing problems [46][47][48] .Additionally, attention should be given to the content these children access on the internet, particularly concerning issues like pornography, violence, terror, or gambling, as they can contribute to unethical thoughts and behaviours that are harmful to both the children and society 49 .

Limitation of the study
There are a few limitations that need to be acknowledged in regard to the study.Firstly, it is important to note that the sample used in this study is limited to Australia.Therefore, the findings and conclusions may not be applicable to other countries or populations.However, it is worth mentioning that the study has analysed a comprehensive Australian national dataset, which included children and adolescents aged 11-17 years.The large sample size and diverse range of participants enhance the potential generalizability of the research findings.Additionally, it is important to recognize that the study relied on yes-no categorical variables.While this approach may not fully capture the complexities of the factors contributing to school absenteeism and the development of a school refusal attitude, it does provide a straightforward and clear method for examining the presence or absence of certain factors related to school absenteeism.This simplification aids the analysis process and can lead to actionable recommendations.Another limitation to consider is that the research excluded 'Unknown' categories, which could potentially result in the loss of valuable information and influence the findings and conclusions.Nevertheless, the outcomes of this model illustrate the effectiveness of the data building template in determining the factors associated with school refusal and absenteeism behaviours.

Conclusion
Attending school is the only way for learning to expand the options and improve overall chances of success.Therefore, it is essential to identify the causes of school refusal and absenteeism behaviour in children and adolescents.In this study, Apriori has proven to be an efficient association rule generator for determining the associated factors of school refusal and absenteeism behaviour using YMM, a large dimensional dataset of children and adolescents' mental health in Australia.Moreover, the results from the logistic regression model reveal that being bullied, bothered without internet/electronic games, suicide attempt, and feeling that life is not worth living are the most significant factors for missing school.Surprisingly, children and adolescents did not include these as reasons for school absence.Furthermore, Apriori identifies several other characteristics related to school refusal and absenteeism behaviour in children and adolescents, such as restlessness, being easily distracted and angry, worrying, although these are not statistically significant in logistic regression.The serious implications of school refusal and absenteeism on a student's future prospects, including lower incomes, higher unemployment rates, and compromised health, make it imperative for parents, teachers, and school officials to understand the significance of these newly identified contributing factors.By taking these factors into account, school attendance can be prioritized as a fundamental concern.

Materials and methods
The phenomenon of school refusal and absenteeism among children and adolescents is a multifaceted problem that is influenced by various causes.In order to understand and address this behaviour, a comprehensive model has been developed to examine the underlying causes.The framework of the study analysis is shown in Fig. 3.

Dataset
In this study, the factors responsible for school refusal and absenteeism of the children and adolescents have been detected using YMM, a nationwide cross-sectional Australian data organized by the University of Western Australia (UWA) for the Telethon Kids Institute.It is funded by the Australian Department of Health 50 .This dataset is available by submitting a request to the Australian Data Archive (ADA) at https:// datav erse.ada.edu.au.The data collection process has received ethical approval from the Human Research Ethics Committees of AGDH and UWA, respectively 50,51 .YMM data has been collected using a multi-stage, area-based random sample technique.It has been designed to be representative of Australian families with children aged 4-17.If a family had more than one eligible child, the survey has been given to one of them at random.A total of 6310 parents/ careers (55% of eligible households) of children and adolescents aged 4-17 voluntarily participated in the study.

Data processing
Variable www.nature.com/scientificreports/difficulties.Categories like 'Not at all' and 'Never' are replaced with 'No' to capture the absence or lack of something.This grouping of similar responses into binary categories creates a more manageable dataset that can be easily interpreted by the model.The aim is to capture the underlying patterns and relationships between variables, rather than focusing on the specific values themselves.While binary representation may not capture the nuances of the original responses, it is a trade-off made to simplify the analysis and enhance the model's ability to generalize and make accurate predictions.This approach allows for the identification and understanding of significant patterns and trends, even if some detailed information about the original values is sacrificed.
Any category with more than 2000 'Unknown' values is excluded from the analysis.Out of the remaining variables, 533 categorical variables with 'Yes'/'No' have been selected.Additionally, 3 categorical variables (named 'year of school' , 'main reasons of missing school' and 'age') with multiple values (where year of school and age are quantitative variables) have also been selected.In total, 536 variables have been selected from the original dataset, which initially comprised 680 variables.The column values have been converted to numeric values using the factorize() function to encode the string variables.

Dummy variable creation
Dealing with multiple values in the data input can pose challenges for the model's ability to accurately comprehend and interpret the data.This can result in the model failing to recognize recurring patterns and treating them as separate entities, leading to inaccurate forecasts.To address this issue, it is recommended to use dummy variables, which effectively represent different categories, especially when dealing with numerous instances in the input characteristics.This approach enhances the model's understanding and assimilation of the data, ultimately leading to more precise predictions.To simplify the process of uncovering associations between variables using the pandas.get_dummies()function, each variable in the dataset has been coded as either 'Yes' or 'No' with corresponding numerical values.Thus, a dummy variable has been created for each potential value, where 0 signifies 'No' , and 1 signifies 'Yes' .

Target variable
The variable pertaining to the question 'What was the primary reason for missing school?' has eight categories that explain the reasons for missing school.These categories include sickness, doctor's appointments, family members' sickness, conflicts with parental work, lack of transportation, lack of interest in attending school, family events, and other reasons.Dummy variables have been created for each of these categories.In order to analyse the causes of school absenteeism and refusal among Australian children and adolescents, the category of 'lack of interest in attending school' has been selected as the target variable.

Methodology
The Python 3.7.3sci-kit-learn package has been used to create a machine learning model using the association rule learning technique.Specifically, the Apriori algorithm, which is a well-known algorithm for association rule mining, has been applied to discover the variables that frequently occur together and contribute to certain behaviours in the YMM dataset.

Association rule mining
Association rule mining is a method used to uncover important patterns and associations in large datasets 23,52 .It involves identifying correlations between items, events, or variables and generating rules that capture these associations.The aim is to extract rules that express relationships between various items in the dataset, typically in the form of 'if-then' statements, where the antecedent (if-part) represents the presence of certain items or events, and the consequent (then-part) represents the occurrence of other items or events 53 .This feature of item association discovery, along with its ability to be applicable across different domains and its lack of prior assumptions, makes association rule mining an invaluable tool in data mining and analytics.

Apriori
The Apriori algorithm is widely recognized as the primary method for association rule mining and discovering new patterns of association 32,33,54 .In this study, the Apriori has been used to analyse patterns of student behaviour, specifically in identifying relationships between different reasons for missing school.By using the Apriori algorithm, frequently occurring combinations of absence reasons are identified, which provide insights into associations between variables related to school absenteeism and refusal attitude in the YMM dataset.The Apriori methods that have been followed in this study are as follows 55

Performance measure
To evaluate the performance of the method, four metrics are calculated: support, confidence, lift, and conviction 56 : Support: Support indicates the frequency of an item appearing in the dataset.The support for the combination X and Y will be the following equation: Confidence: Confidence measures the reliability of a rule.It is the conditional probability of the consequent (Y) given the antecedent (X) that can be measured with following equation: Lift: Lift quantifies the strength of association between the components of a rule which is measured through the equation: Conviction: Conviction calculates the probability of one event occurring without another when they are dependent on each other, and this can be calculated using the following formula.
The Apriori algorithm uses these metrics to evaluate the strength and likelihood of association between the rule body and the rule head.Support refers to the proportion of transactions in the dataset that contain both the rule body and the rule head, lift measures the strength of association, and conviction measures the likelihood of the rule head occurring given that the rule body has already occurred.If support is less than 1 but lift and conviction are greater than 1, it suggests that although the rule occurs infrequently in the dataset, there is a strong association between the rule body and the rule head.High lift and conviction indicate that the occurrence of the rule body has a positive effect on the occurrence of the rule head, even if the overall support for the rule is low.
To ensure a high degree of accuracy and strong relationships between variables, a minimum support value of 3% (min_support = 0.03) has been set.This parameter is used by the Apriori method to reduce candidate rules by establishing a minimum lower bound for the support measure of the generated association rules.
It is important to note that association does not imply causality, despite the multiple connections between predictors of school absenteeism in children and adolescents uncovered through association rule mining.Therefore, a multivariate methodology has been employed to determine the optimal predictive variables and explain the phenomenon of school absenteeism.Estimates, odds ratios, and confidence intervals have been used to assess statistical significance of the findings.

1 .
Which children refuse to attend school, 2. What are the reasons for their absence, 3. Most importantly, what are the underlying factors contributing to school absenteeism among children and adolescents, and at what point does it transition into school refusal, is there anything that parents, teachers, or school officials should be aware of?

Figure 1 .
Figure 1.Students who missed schools for various reasons by age.

Figure 2 .
Figure 2. Students who miss school for various reasons influenced by associated factors.

Figure 3 .
Figure 3. Functional pattern of the proposed method.
: (a) Itemset generation: Identification of frequently occurring variables, example: If X and Y are two variables, then (X, Y) is a representation of the list of all items which form the association rule (b) Rule generation: Finding interesting patterns and trends between variables, example: (X → Y) is a representation of finding Y in all items which has X on it (c) Apriori principle: Construction of all subsets of frequently occurring variables by diving them into two components such as antecedent and consequent (d) Apriori algorithm: Cleaning the deductive rules and selecting the association rules based on interestingness measure such as support, confidence, lift and conviction (e) Maximal frequent itemset: Identification of the frequently encountered variables such that none of the immediate variables are frequently encountered Vol.:(0123456789)Scientific Reports | (2024) 14:1907 | https://doi.org/10.1038/s41598-024-51230-4www.nature.com/scientificreports/(f) Closed frequent itemset: Identification of frequently occurring variables such that no other frequently occurring variables have the same support value

Table 1 .
Associated factors related to the disinterest in going to school.

Table 2 .
Influence of school missing factors by age.dependencies on internet usage or electronic gaming are approximately 1.29 times more likely to be absent from school compared to their counterparts (OR: 1.29, 95% CI: 1.06, 1.58).Other significant factors associated with school absenteeism include suicide attempts and the belief that life is not worth living.Children who have attempted suicide and express feelings that life is not worth living are 1.66 times (OR: 1.66, 95% CI: 1.20, 2.31) and 1.74 times (OR: 1.74, 95% CI: 1.20, 2.52) more likely to miss school than their respective counterparts.

Table 3 .
Binary logistic regression results for school absence across the various factors.